Even with the introduction of the dplyr package, it is often quicker to transform data using functions in base R. In this document, we will demonstrate how some of the functionality of dplyr can be replicated in base R. We’ll use the built-in mtcars datset as a running example.

library(dplyr)
data(mtcars)
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

select

In dplyr, the select function is used to choose the columns we want to keep. This functionality can be replicated easily in base R. Recall that a data frame can be viewed as a matrix as well; hence, we can pick out the columns we want by using square brackets and inserting a vector after the comma within the square brackets. For example, we we want the mpg and drat columns, we could do

df <- mtcars[, c("mpg", "drat")]
head(df)
##                    mpg drat
## Mazda RX4         21.0 3.90
## Mazda RX4 Wag     21.0 3.90
## Datsun 710        22.8 3.85
## Hornet 4 Drive    21.4 3.08
## Hornet Sportabout 18.7 3.15
## Valiant           18.1 2.76

If we knew the column numbers, we could simply specify that:

df <- mtcars[, c(1, 5)]
head(df)
##                    mpg drat
## Mazda RX4         21.0 3.90
## Mazda RX4 Wag     21.0 3.90
## Datsun 710        22.8 3.85
## Hornet 4 Drive    21.4 3.08
## Hornet Sportabout 18.7 3.15
## Valiant           18.1 2.76

To select all columns except those specified, add a minus sign in front of the vector:

df <- mtcars[, -c(1, 5)]
head(df)
##                   cyl disp  hp    wt  qsec vs am gear carb
## Mazda RX4           6  160 110 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       6  160 110 2.875 17.02  0  1    4    4
## Datsun 710          4  108  93 2.320 18.61  1  1    4    1
## Hornet 4 Drive      6  258 110 3.215 19.44  1  0    3    1
## Hornet Sportabout   8  360 175 3.440 17.02  0  0    3    2
## Valiant             6  225 105 3.460 20.22  1  0    3    1

mutate

To replicate mutate’s functionality, recall that under the hood, a data frame is a list, with the column names being keys and the values in the column being the values. Hence, we can use list notation to specify a new column.

For example, if we wanted to create a new column kml to denote kilometers per hour (instead of mpg miles per gallon), we could do

mtcars$kml <- mtcars$mpg * 1.609 / 3.785
head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
##                        kml
## Mazda RX4         8.927081
## Mazda RX4 Wag     8.927081
## Datsun 710        9.692259
## Hornet 4 Drive    9.097120
## Hornet Sportabout 7.949353
## Valiant           7.694293

Note that we have to type mtcars$mpg on the right instead of just mpg. That is because mpg in itself is not an object in our global environment.

We can use the same syntax to overwrite an existing column. For example, the original wt column gives the weight of the car in 1000 lbs. The code below changes the column to give the weight of the car in kilograms:

mtcars$wt <- mtcars$wt * 1000 / 2.2046
head(mtcars)
##                    mpg cyl disp  hp drat       wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 1188.424 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 1304.091 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 1052.345 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 1458.314 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 1560.374 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 1569.446 20.22  1  0    3    1
##                        kml
## Mazda RX4         8.927081
## Mazda RX4 Wag     8.927081
## Datsun 710        9.692259
## Hornet 4 Drive    9.097120
## Hornet Sportabout 7.949353
## Valiant           7.694293

If we want to define a new column based on more than one of the original columns, we can do so but have to keep typing out the name of the data frame:

mtcars$nonsense <- mtcars$vs + mtcars$am + mtcars$gear
head(mtcars)
##                    mpg cyl disp  hp drat       wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 1188.424 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 1304.091 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 1052.345 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 1458.314 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 1560.374 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 1569.446 20.22  1  0    3    1
##                        kml nonsense
## Mazda RX4         8.927081        5
## Mazda RX4 Wag     8.927081        5
## Datsun 710        9.692259        6
## Hornet 4 Drive    9.097120        4
## Hornet Sportabout 7.949353        3
## Valiant           7.694293        4

For more succinct code, we can use the with function: it allows us to type just the column names.

mtcars$nonsense <- with(mtcars, vs + am + gear)
head(mtcars)
##                    mpg cyl disp  hp drat       wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 1188.424 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 1304.091 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 1052.345 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 1458.314 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 1560.374 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 1569.446 20.22  1  0    3    1
##                        kml nonsense
## Mazda RX4         8.927081        5
## Mazda RX4 Wag     8.927081        5
## Datsun 710        9.692259        6
## Hornet 4 Drive    9.097120        4
## Hornet Sportabout 7.949353        3
## Valiant           7.694293        4

arrange

First, let’s restore the original mtcars dataset:

data(mtcars)

Arranging the rows in base R is a little complicated. Let’s say we want to order the dataset according to mpg in ascending order. We first call the order function on mtcars$mpg, which returns a vector telling us which row has the smallest mpg, then the second smallest mpg, and so on. From the result below, it looks like row 15 has the smallest mpg value, followed by row 16.

order(mtcars$mpg)
##  [1] 15 16 24  7 17 31 14 23 22 29 12 13 11  6  5 10 25 30  1  2  4 32 21
## [24]  3  9  8 27 26 19 28 18 20

To sort the whole dataset according to mpg, we use square brackets and insert the output of order before the comma within the square brackets:

mtcars[order(mtcars$mpg), ]
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1

Sorting in descending order can be done by specifying decreasing = TRUE in the order function:

head(mtcars[order(mtcars$mpg, decreasing = TRUE), ])
##                 mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2

order can take more than one argument: if two rows tie on the first argument, the second argument is used to break ties, and so on. The code below sorts the rows by cyl, then vs, then mpg:

mtcars[with(mtcars, order(cyl, vs, mpg)), ]
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2

filter

If I wanted to look at just the cars with 8 cylinders (i.e. cyl == 8), I could do it the following way using dplyr (notice the double equal sign to test for equality; signal equal sign represents assignment instead):

mtcars %>% filter(cyl == 8)
##     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1  18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## 2  14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## 3  16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## 4  17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## 5  15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## 6  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## 7  10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## 8  14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## 9  15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## 10 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## 11 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## 12 19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## 13 15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## 14 15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8

To do this in base R, we first need to know another way of extracting elements from a vector. Consider the vector below:

vec <- 1:3

To extract a group of elements from vec, we previously used square bracket notation, with a vector of indices that we wanted to extract:

vec[c(1,2)]
## [1] 1 2

Another way to extract elements is by putting a logical vector of the same length in the square brackets. R will then extract those elements which match to TRUE. For example, the code below extracts the first and third elements:

vec[c(TRUE, FALSE, TRUE)]
## [1] 1 3

To extract all the observations with exactly 8 cylinders, we can do this:

df <- mtcars[mtcars$cyl == 8, ]
table(df$cyl)
## 
##  8 
## 14

To extract observations with less than 8 cylinders:

df <- mtcars[mtcars$cyl < 8, ]
table(df$cyl)
## 
##  4  6 
## 11  7

To extract observations such that the number of cylinders is not 8:

df <- mtcars[mtcars$cyl != 8, ]
table(df$cyl)
## 
##  4  6 
## 11  7

summarize and group_by

If we wanted the mean mpg for the table, we would do the following in dplyr:

mtcars %>% summarize(mean_mpg = mean(mpg))
##   mean_mpg
## 1 20.09062

In the above, mean_mpg is the name of the new column, while mean(mpg) is the function to be applied to the data frame. In base R, we would just apply the function directly:

mean(mtcars$mpg)
## [1] 20.09062

dplyr’s summarize function has the power to perform multiple summaries at the same time and have the result be a data frame as well. There is no base R equivalent for this functionality.

dplyr’s summarize function also works really well with group_by to give summaries for subgroups. For example, it is easy to get the mean mpg value for cars of different cyl values:

mtcars %>% group_by(cyl) %>%
    summarize(mean_mpg = mean(mpg))
## # A tibble: 3 x 2
##     cyl mean_mpg
##   <dbl>    <dbl>
## 1     4     26.7
## 2     6     19.7
## 3     8     15.1

It turns out that base R can do this too, but such code is rare nowadays:

by(mtcars, mtcars$cyl, function(x) mean(x$mpg))
## mtcars$cyl: 4
## [1] 26.66364
## -------------------------------------------------------- 
## mtcars$cyl: 6
## [1] 19.74286
## -------------------------------------------------------- 
## mtcars$cyl: 8
## [1] 15.1